MOSCATO: a supervised approach for analyzing multi-Omic single-Cell data

Background Advancements in genomic sequencing continually improve personalized medicine, and recent breakthroughs generate multimodal data on a cellular level. We introduce MOSCATO, a technique for selecting features across multimodal single-cell datasets that relate to clinical outcomes. We summarize the single-cell data using tensors and perform regularized tensor regression to return clinically-associated variable sets for each ‘omic’ type. Results Robustness was assessed over simulations based on available single-cell simulation methods, and applicability was assessed through an example using CITE-seq data to detect genes associated with leukemia. We find that MOSCATO performs favorably in selecting network features while also shown to be applicable to real multimodal single-cell data. Conclusions MOSCATO is a useful analytical technique for supervised feature selection in multimodal single-cell data. The flexibility of our approach enables future extensions on distributional assumptions and covariate adjustments. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08759-3).


Background
Classic bulk genetic sequencing involves averaging signature levels across all cells. Different sequencers may sequence different types of molecules such as ribonucleic acid (RNA), proteins, DNA methyl groups, etc. Disease progression, therapy success, and other clinical outcomes often vary among individuals suffering from complex diseases [3,7,17,35], and the heterogeneity in their outcomes may be better understood through the intricacies of a patient's molecular signatures [1,5,24,26]. This has led to an explosive demand for multi-omics which involves integrating multiple types of molecular information in order to have a more Systems Biology approach. For example, in breast cancer patients with resistance to lapatinib therapy, Komurov et al. were able to suggest additional therapy targets by identifying combinations of RNA and proteins responsible for glucose deprivation that was associated with the resilience [18]. Methods for identifying graphs and gene regulatory networks within a single molecular type has been well studied [14,20], however, different methods should be considered when integrating multiple types of molecular information in order to accommodate the between and within molecular relationships [6]. Each molecular type often contains thousands of features, and integrating them creates a higher dimensional problem with more sophisticated relationships both within and between molecular types. For example, the Decomposition of Network Summary Matrix via Instability (DNSMI) method decomposes a matrix of network strengths by fitting a series of models for the expected relationships across molecular types and with the disease outcome [38]. Supervised sparse canonical correlation analysis (SCCA) attempts to optimize the correlation matrix between molecular types through lasso constrained linear combinations of the features and also eliminates features weakly correlated with the outcome [36].
In bulk sequencing experiments, rare cells or smaller cell-types will be diluted due to the averaging across all cells within the sample. This motivated singlecell sequencing techniques where molecular information could then be sequenced on a cell-by-cell basis. While initial protocols were limited to RNA [23,29], newer technology may now sequence multiple types of molecular information within each cell, denoted as multi-modal (or multi-omic) single-cell sequencing. For example, CITEseq simultaneously sequences both cell surface proteins and RNA on each cell of a sample [32]. Although still a growing technology, applications have already been considered using this novel sequencing approach. For example, Kendal et al. utilized CITE-seq technology to compare tendons in healthy individuals to those with tendinopathy [15]. Figure 1 displays an example of single-cell data from each patient.
Many bulk sequencing problems utilize matrix decomposition techniques for various analytical goals where the dimensions correspond to different feature sets (e.g., RNA or proteins). However, single-cell sequencing essentially creates an added dimension for cells that did not previously exist in bulk sequencing. Tensors provide a general framework for organizing high dimensional data, and a matrix is a special case of a two dimensional tensor. There-fore, leveraging tensors for single-cell sequencing is an interesting approach to accommodate the added cellular dimension. For example, ScLRTC and scLRTD utilize tensor decompositions to impute dropouts or missing data found in single-cell sequencing data [25,28]. scTenifold-Net also utilizes tensor decomposition to identify unsupervised gene regulatory networks in single-cell RNA-seq data [27].
This manuscript proposes a novel method, Multi-Omic Single-Cell Analysis using TensOr regression (MOSCATO), for identifying the superstructure of a semidirected graph, or network, within multimodal single-cell data that relates to a disease or phenotypic outcome. Current single-cell methods are often either limited to one subject/experiment, limited to just RNA, unsupervised with no clinical outcome associations, or intended for cell clustering and not feature selection. MOSCATO uses regularized tensor regression to address each of those common limitations found in existing methods. "Preliminaries" section describes preliminary tensor concepts, "Results" section presents the MOSCATO results from a series of simulations and a real data application, "Discussion" and "Conclusions" sections discuss future For each subject in a study, their sample is sequenced in a multimodal single-cell sequencer which returns a dataset for each 'omic' type. These datasets are constructed (rows for cells and columns for features) across all subjects in the study. Each subject also has a disease outcome, and the study goal is to identify patterns across features which relate to the disease outcome work and limitations, and "Methods" section describes the mathematical details behind MOSCATO. MOSCATO is applicable when two 'omic' types of single-cell data are present (e.g., RNA and proteins) with a univariate outcome of interest.

Preliminaries
MOSCATO utilizes regularized tensor regression, and this section describes existing and relevant tensor concepts. "Tensor definitions" section defines tensors and basic tensor operations, and "Tensor regression" section uses the operations and definitions from "Tensor definitions" section to describe tensor regression and regularization techniques.

Tensor definitions
High dimensional data may be organized into a tensor, and a matrix may be thought of as a 2-dimensional tensor. Utilizing familiar tensor notation as provided by Kolda and Bader [16], we let Z ∈ R p 1 ×p 2 ×···×p D denote a Ddimensional tensor where dimension d contains p d variables for d = 1, ..., D. For example, D = 1 denotes a vector and D = 2 denotes a matrix. Many mathematical operations for tensors build on mathematical operations used in matrices. For example, Definition 1 describes outer products between D vectors to create a D-dimensional tensor, where • denotes the Khatri-Rao product.
Then the outer product of those vectors, b 1 • b 2 • · · · • b D , creates a D-dimensional tensor of size p 1 ×p 2 ×· · ·×p D , and each (i 1 , ..., i D ) th element equals D d=1 b d i d .
It may also be convenient to reorganize a tensor into a lower dimensional space by vectorizing or mode-d matricizing the tensor. Definitions 2 and 3 describe these reorganization techniques.
Definition 3 Let Z ∈ R p 1 ×p 2 ×...×p D denote a Ddimensional tensor. Then Z may be reorganized into a matrix through the mode-d matricization operator Similarly as done in matrix operations, it may be of interest to multiply two tensors with comparable dimensions via inner products, as described in Definition 4.

Definition 4
Suppose two tensors B ∈ R p 1 ×···×p D and Z ∈ R p 1 ×···×p D . The inner product may be obtained by Furthermore, it may be of interest to multiply a matrix along the d th dimension of a tensor through d-mode products as described in Definition 5.
Definition 5 Suppose a tensor Z ∈ R p 1 ×···×p d ×···×p D and a matrix U ∈ R q×p d . The d-mode product between Z and U may be expressed as The rank of a matrix denotes the maximum number of linearly independent rows/columns in the matrix. Building on those concepts, the rank of a tensor may be thought of as the maximum number of vectors that can be multiplied and added to replicate the tensor, as shown in Definition 6.

Definition 6 Assume a D-dimensional tensor
The true rank of a tensor may often be difficult to determine due to the high dimensionality, motivating decomposition techniques that estimate vectors for a given rank that approximate the tensor, as shown in Definition 8. where Kolda and Bader present additional details on decomposition and other tensor operations [16].

Tensor regression
Building on notation covered in "Tensor definitions" section, this section will briefly describe tensor regression that was originally presented by Zhou et al. [39]. Tensor regression builds on Generalized Linear Model (GLM) concepts, where we have an outcome y for each subject that follows some exponential family with link function g(·) and mean μ. Classic GLM uses univariate independent variables to predict the outcome, but tensor regression extends those concepts by additionally allowing a predictor tensor. This is accomplished by multiplying the dimensions of the tensor through coefficient vectors that convert the dimensions to a univariate value that may then predict the outcome. Figure 2 shows a simple example with rank-1 tensor regression and D = 2 dimensions for the predictor tensor. Each dimension in the predictor tensor corresponds to a different feature set, and each subject will contain its own D = 2 predictor tensor to be used to predict their univariate outcome y.
Imaging modalities such as magnetic resonance imaging (MRI) present an excellent application for tensor regression models. For example, a particular brain MRI slice may be of interest for an outcome, say, disease status. This image slice can be expressed in an array structure (i.e., a 2-dimensional tensor) where the value at a given pixel denotes the MRI signal associated with that location. Referring to the model shown in Fig. 2, the MRI image slice would correspond to the predictor tensor, Z, and the estimated coefficients would indicate regions of interest associated with the outcome.
Tensor regression may also involve higher rank problems with the more formal representation where U contains the univariate independent variables (e.g., age or sex). A rank-R tensor regression estimates R coefficient vectors for each dimension in the predictor tensor, but for simplicity, in this manuscript we will assume rank-1. The Block Relaxation Algorithm is used to estimate the coefficient vectors with additional details described by Zhou et al. [39]. Zhou et al. [39] also claim that regularization in tensor regression may be accomplished by simply imposing constraints when fitting the models on each dimension. If one naively vectorized the predictor tensor and fit a classic GLM model, it would require estimating D d=1 p d coefficients for the tensor. This approach would not only ignore the inherent structure of the data by treating each element in the tensor as independent with no distinction between the dimensions, it would also attempt to estimate many more coefficients compared to R D d=1 p d coefficients in tensor regression. Consequently, this naive approach may be unrealistic in high dimensional problems given a typically much smaller sample size. This reduction in parameters highlights the benefits of tensor regression. However, it is subject to limitations such as uniqueness and identifiability. For example, suppose a rank-1 model with D = 2, g(y) = β 1 T Zβ 2 . Then for any scalar τ , we could derive an equally optimal model g( where β 1 = τ β 1 and β 2 = β 2 /τ . Additionally, the Block Relaxation Algorithm may converge to a local maxima as opposed to the global maxima when attempting to maximize the log likelihood. Zhou et al. [39] describe measures that may be used to check whether these issues are present in the estimated tensor model.

Results
MOSCATO was applied to both simulated data and real data. Results from the simulations are presented in "Simulation results" section and results from the real data application are presented in "Real data application results" section. To our knowledge there are no appropriate methods that easily match the goals of MOSCATO. Nevertheless, to create a sensible competitor we employed and predictor tensor Z with D = 2 dimensions. Each dimension in the predictor tensor corresponds to a feature set, and each feature set contains its own coefficient vectors. The coefficient vectors in this example, β 1 and β 2 , may be estimated by collecting outcomes and predictor tensors across multiple subjects and applying the Block Relaxation Algorithm for estimation  Table 1 a reasonable, but ad hoc, alternative method. MOSCATO was benchmarked against a competing feature selection technique using area under the receiver operating curve (AUC). In short, AUC methods select features that best predict the outcome according to estimated receiver operating curves (ROCs). AUC selections were based on either Bonferroni adjusted p-values or whether the AUC was less than 0.3 or greater than 0.7.

Simulation results
Two 'omic' types denoted as G and X , along with an outcome, were simulated for each subject, and simulations were performed under 9 different settings accounting for differing number of cells per subject and level of technical noise. For additional details on the simulations, refer to "Simulation details" section. Figure 3 displays the tuned maximum network size for G and X , denoted as max G and max X . In a perfect execution of MOSCATO, max G would be tuned to 10 and max X would be tuned to 15 due to known network sizes in the simulated data. All simulations tuned max G and max X to values greater than the true number of network features, regardless of the simulation setting. For max X , smaller values were tuned as technical noise decreased and number of cells increased, but this trend did not persist when tuning max G . Figure 4 displays the sensitivity and specificity across the 9 simulation settings for data types G and X for the 3 different feature selection methods (MOSCATO, AUC using p-values, and AUC using cutoffs). Sensitivity measures the probability that network features are properly included in the selections, and specificity measures the probability that non-network features are properly Noise in X 1500 The table summarizes the number of features within each latent variable used in the simulations. These latent variables are displayed in Fig. 7. H describes the subset of features within G that relate to the outcome but not any features within X , G describes the subset of features within G belonging to the network, and G describes the subset of features within G related to some features within X but not the outcome. S, X, and X describe similar subsets of features within X where the performance actually degraded as the technical noise reduced, it also selected too many features such that the results were not remotely sparse. This explains that while the sensitivity remained high for all simulations, this is simply due to the fact that nearly all features were selected using that criteria. AUC selections based on cutoffs resulted in opposite issues where it did not select nearly any features and produced poor sensitivity with nearly perfect specificity. MOSCATO reproduced the superstructure of the graph (i.e., network) reasonably well with generally high sensitivity and also limited false positives present. This is especially true when comparing against approaches using the AUC. However, when it is expected that high levels of technical noise are present with limited cells per subject, caution should be used when considering MOSCATO.

Real data application results
Leukemia is a broad disease that encompasses all cancers that occur in blood cells. The 5-year survival rate is about 65% according to data from 2011 to 2017, and about 459,000 people were living with leukemia in 2018 in the United States [12]. Leukemia may be classified based on progression speed where chronic denotes slow progression and acute denotes aggressive progression. In addition to cancer progression, leukemia may be subtyped by the type of cells where the cancer forms. For example, lymphocytic leukemia describes cancer developing from white blood cells and myelogenous leukemia describes cancer developing in blood forming cells within the bone marrow. Although rare, one may have both lymphocytic and myelogenous leukemia which is denoted as mixed phenotype leukemia.
To assess MOSCATO in practice, we applied it to real single-cell data with multiple data types. Limited data is currently available due to the infancy of multimodal single-cell sequencing, so data across multiple studies that all used CITE-seq protocols [32] on bone marrow / peripheral blood cells were used. CITE-seq produces cellular level RNA information and cell surface protein abundance via antibody derived tags (ADT) simultaneously, and our outcome of interest will be leukemia versus healthy patients. Our goal will be to apply MOSCATO to this data in order to obtain a subset of RNA and ADT features associated with leukemia. After combining the data across studies, we have 14 healthy patients and 7 patients with leukemia. Of the 7 leukemia subjects, 1 had chronic lymphocytic leukemia while the other 6 had mixed-phenotype acute leukemia. The studies used and further details are described in "Real data application details" section.
The data was pre-processed and integrated using Seurat . We pre-processed and integrated the data using Seurat version 4.0.3 [10], and this figure displays the UMAP [2] results after the integration. The plots are split by healthy versus leukemia subjects ifold Approximation and Projection (UMAP) [2] plots from the cell clusters established from the integrated data across all 21 subjects. The Seurat workflow normalizes the scRNA-seq data, identifies the highly variable genes, scales the data, performs reciprocal principal component analysis using the highly variable genes, finds the nearest neighbors for each cell, and clusters the cells. After integration, 8 of the cell clusters (clusters 0, 1, 2, 3, 4, 5, 6 and 14 from Fig. 5), 17991 RNA features, and 5 ADTs (CD3, CD4, CD14, CD19, and CD56) were measured across all subjects.
In this application after pre-processing, each subject contained 8 scRNA-seq matrices (one for each cell cluster) where each matrix contained 17991 columns and rows corresponding to the cells within that cluster. In addition to the scRNA-seq datasets, each subject contained another 8 matrices (one for each cell cluster) for their single-cell ADT abundance where each matrix contained 5 columns and rows corresponding to the cells within that cluster (same cells as in the scRNA-seq datasets). Finally, each subject had a binary disease status (healthy or leukemia). The feature selection techniques were applied to the 8 pairs of matrices separately (i.e., each cell cluster treated independently), resulting in 8 different MOSCATO and AUC selection results. The first step of MOSCATO estimates the correlation between each subject's scRNA-seq and single-cell ADT features, resulting 21 (i.e., one for each subject) correlation matrices, each of size 17991 × 5.
Analysis was performed on each cell cluster separately, and the selections were later reviewed for biological relevancy. The rest of this section will focus on the results obtained from cell cluster 0, but the complete results from MOSCATO, AUC selections based on p-values, and AUC selections based on the AUC cutoffs for each of the 8 cell clusters are provided in Additional file 2. In summary, the number of features selected by MOSCATO and AUC cutoffs were similarly sized, but AUC selections based on pvalues resulted in nonsparse feature sets. DAVID [11,31] was used to analyze and organize the gene ontology information from the RNA gene selections. DAVID clusters genes based on common annotations and functional information, and DAVID only allows clustering on gene sets with less than 3000 genes. Since the AUC selections based on p-values resulted in RNA selections well over this 3000 restriction for most cell clusters, we only focused on gene clusters from the MOSCATO and AUC cutoff selections.
MOSCATO selected 96 RNA features within cell cluster 0, and DAVID identified 2 gene clusters. The strongest gene cluster (based on highest enrichment score) used 7 of these 96 genes. This was the most notable gene cluster which included the genes CD3D, CD3E, and CD3G which are part of the KEGG pathway for Human T-cell Leukemia Virus type 1 (HTLV-I) infection (KEGG pathway hsa05166), and HTLV-I infections are a known risk factor for developing adult T-cell leukemia/lymphoma [13]. Additionally, the genes CD2, CD3D, CD3E, CD3G, and CD4 within this gene cluster belong to the KEGG pathway for Hematopoietic cell lineage (hsa04640) which assists in producing blood cells. Given that the disease of interest in this application is based on leukemia (i.e., cancer in tissues which produce blood), it is reassuring that this gene cluster contains genes associated with blood production. Also, genes CD2, CD3D, CD3E, CD3G, KLRB1, CD247, and CD4 within the gene cluster are associated with the gene ontology for the cell surface receptor signaling pathway (GO:0007166) which makes sense given that MOSCATO summarized the single-cell data between RNA and cell surface proteins (i.e., ADTs). Of the 5 cell surface proteins considered, MOSCATO selected the ADT's CD3 and CD4 for this cell cluster. Figure 6 displays the RNA features within the strongest functional DAVID cluster, along with the ADT selections. No features selected by MOSCATO under cell cluster 0 were selected by AUC cutoffs, and although 83 of the 96 MOSCATO RNA selections were also selected by AUC based on p-values, the AUC selections based on pvalues selected nearly half of all RNA features considered. AUC selections based on cutoffs selected 11 RNA features and 0 ADTs for cell cluster 0, and DAVID was not able to discern any gene clusters based on the genes selected.
In conclusion, the selections made by MOSCATO under cell cluster 0 resulted in a concise gene cluster discovered by previously known annotations and functionalities. These functionalities not only related to the disease of interest (i.e., leukemia), it also related to cell surface functionalities. This highlights that MOSCATO not only considers supervised information (e.g., disease versus no disease), it also considers the relationships across data types (e.g., RNA and cell surface proteins). Performing feature selection using solely AUC not only neglects the between data type relationships, it also was not able to return a concise set of genes that related to leukemia. Furthermore, it is arbitrary to select pre-specified AUC cutoffs for selections, and the p-value selections did not produce sparse solutions. Also, AUC selections were based directly on normalized expression/sequenced values, but MOSCATO performs selections based on the similarities between data types. This possibly helps reduce batch effects found in individual subjects by standardizing each value between -1 and 1.

Discussion
MOSCATO was performed on both simulated and real data. The simulations produced fairly accurate results with a sensitivity and specificity close to 1 for many of the simulations, although MOSCATO did not perform as well in situations with high technical noise or low cell counts  [11,31], along with all ADT selections (right labels). DAVID determines gene clusters based on similar gene ontology and annotations. These RNA genes correspond to KEGG pathways for blood cell production (CD2, CD3D, CD3E, CD3G, and CD4) and HTLV-I infection (CD3D, CD3E, and CD3G), as well as genes with ontology information for cell surface receptor signaling (CD2, CD3D, CD3E, CD3G, KLRB1, CD247, and CD4). The label colors display the absolute scaled weight of the coefficient vectors from the tensor regression used in MOSCATO so that higher values correspond to a higher weight put on that gene/protein in the tensor regression per subject. Since MOSCATO calculates its predictor tensor to be the estimated correlation of the datasets per individual, it is unsurprising that cell count would contribute to more accurate correlation estimates. Although not used in this manuscript, covariate adjustments could easily be made in MOSCATO by simply adding them to the tensor regression.
The real data application involved 21 subjects, and given the highly variable nature of single-cell sequencing data, the features selected should be interpreted with some caution. Furthermore, since the data was collected across different studies, only 5 proteins were used that merged across all 21 subjects. As single cell technologies become more available and cost feasible, future experiments should be considered with more consistent proteins across experiments and larger sets of subjects.
MOSCATO currently assumes all cells come from the same cell type, but it might be more interesting to accommodate situations in which multiple cell types are present. We are currently working on higher dimensional applications of MOSCATO in a future manuscript. A reasonable solution could incorporate another step which estimates a similarity matrix for each cell type and includes another dimension to the predictor tensor for cell type. Additionally, MOSCATO was only tested on experiments with two data types, but extensions should be considered in situations where more than two data types are present. This could be accomplished by extending the predictor tensor to accommodate a dimension per data type extracted, although higher dimensional summary measures would need to be considered.
MOSCATO was only tested using Pearson's correlation as the summary measure to construct the predictor tensor Z, but other summaries should be considered. For example, mutual information or inverse covariance matrices might be interesting avenues to explore for future consideration. Graphical lasso [8] is a popular technique to obtain both the nodes and edges of a graph by applying lasso regressions to estimate the inverse of the covariance matrix, and this estimated inverse of the covariance matrix could be explored as the predictor tensor input for MOSCATO.
This study only used rank-1 tensors, but higher ranks could be considered that may unveil other patterns and networks available in the data. The proper rank could be obtained using typical model selection criteria such as cross validation, although interpreting the results may not be as straight forward.
Zhou et al. discuss hypothesis testing via asymptotic normality results [39], and these hypothesis testing schemes could be explored to assess network strength. For example, one could perform a global test whether the model coefficients equal zero for the network selections.
Although MOSCATO returns the superstructure for a graph, it does not provide information on directionality and does not currently consider directional consistency. For example, suppose two genes are positively correlated with each other, but they contain conflicting correlative directions with the outcome. This inconsistency in directionality makes interpreting the results more difficult, and may be mitigated by additionally tuning based on optimizing balance. This concept has been considered in bulk level analyses with a single 'omic' type [34], and it could be considered for future work for multimodal data.

Conclusions
MOSCATO is a statistical technique for analyzing multimodal single-cell data where the study goals are to identify which features within the 'omic' datasets relate to each other and a clinical outcome. MOSCATO was found to perform favorably through a series of simulations and a real data application. Multimodal single-cell data continues to grow in popularity, and feature selection techniques such as MOSCATO may be critical to fully leverage the potential in using the data for highlighting biological markers in complex diseases.

Multimodal network analysis using tensor regression
In classic bulk sequencing, the data contains one record per subject. Supposing n subjects with two data types, bulk sequencing studies would contain two data sets (i.e., a dataset for each data type), G ∈ R n×p and X ∈ R n×q . In single-cell sequencing, there are multiple records per subject where each row corresponds to a cell within the subject. Consequently, for a given subject i with two data types, their single-cell data would contain two datasets G i ∈ R m i ×p and X i ∈ R m i ×q , where m i denotes the number of cells for subject i. Since the number of cells typically differs across subjects (i.e., m i = m i where i = i ), we organize each subject's data into separate datasets (i.e., G i and X i ) as opposed to organizing the input data directly into a 3-dimensional tensor with a dimension for cells. It should also be noted that each subject's data often consists of thousands of cells, and concatenating the single-cell data in long format may be computationally inefficient. Furthermore, we assume each subject i contains a univariate outcome y i for i = 1, ..., n, and we may express the outcomes in a vector as y =[ y 1 , y 2 , ..., y n ] T . For simplicity, we may denote the two data types as G and X without the i subscript, although as described previously, each subject's single-cell data will contain separate matrices for the data types as opposed to expressing data in long format as found in bulk sequencing.
MOSCATO aims to identify a subset of features within G and X that relate to each other and the outcome. In graphical modelling terms, MOSCATO identifies the superstructure of a semi-directed graph with undirected nodes involving features within G and X with some path directed to the outcome y. MOSCATO accomplishes this by imposing elastic net constraints on a tensor regression model [40].
Similarly as in the MRI image example from "Tensor regression" section, multimodal single-cell data contains multi-dimensional data per subject (i.e., features within G and features within X ) with a univariate outcome. This motivates the use of tensor regression for multimodal single-cell data. Additionally, tensor regression not only efficiently accommodates multi-dimensional input data with a univariate outcome, it also handles regularization techniques and allows for additional covariate adjustments (e.g., age, sex, race, etc.). However, tensor regression requires equivalent dimensions for each subject's input tensor, and single-cell sequencing experiments nearly always return differing number of cells. Therefore, to standardize the dimensions across each subject, the first step of MOSCATO involves collapsing the cellular dimension all together by estimating a correlation matrix between their data type matrices, whereρ jk =ĉorr(x ij , g ik ) letting x ij denote the j th feature in X i and g ik denote the k th feature in G i for the i th subject. Although many summary matrices could be considered, such as the inverse of the covariance matrix or mutual information, Pearson's correlation provides a simple interpretation while also standardizing the values within Z i between -1 and 1.
In broad strokes, MOSCATO applies a rank-1 tensor regression on Z to determine the elements in each 'omic' type that are associated with each other and with the outcome. Specifically, using the Z i tensors for i = 1, ..., n to estimate the coefficients, a tensor regression model similar to (5) and depicted in Fig. 2 will be fit with elastic net constraints. The elastic net constraint works to balance by a weighted average between an L 1 -norm and L 2 -norm, where the L 1 -norm truncates small coefficients to zero and the L 2 -norm better handles highly correlated features. In summary, the elastic net constraint typically denoted as λ((1 − α)/2||β|| 2 2 + α||β|| 1 ) in the classical GLM setting will now involve where β X ∈ R q denotes the coefficient vector for X , β G ∈ R p denotes the coefficient vector for G, β X j denotes the coefficient for the j th feature in X , and β G k denotes the coefficient for the k th feature in G. The hyperparameters α X ∈[ 0, 1] and α G ∈[ 0, 1] denote the weights to put on the L 1 -norm constraints for X and G, respectively. The hyperparameter λ in the classical GLM setting denotes the overall weight to put on the constraint, and it may be any positive number from 0 to infinity. Since tuning λ to the proper range may be difficult due to the nontrivial parameter space, we use max X and max G instead to denote the maximum number of non-zero values within β X and β G , respectively. This reparameterization of the constraints drastically simplifies the hyperparameter space and subsequent tuning, as described in "Tuning on stability" section. For some fixed α X , α G , max X , and max G , the tensor regression model will be fit to obtainβ X ∈ R q andβ G ∈ R p . Due to the L 1 -norm truncating small values to zero from the elastic net constraint, only a subset of values withinβ X andβ G will be nonzero. Thus, final network features within data type X will be the set {j :β X j = 0, j = 1, ..., q}, and final network features within data type G will be the set {k :β G k = 0, k = 1, ..., p}. Algorithm 1 summarizes the steps to MOSCATO.

Algorithm 1 Schematic for MOSCATO
Require: X , G, y for i = 1 to n do Z i = Cor(X i , G i ) end for Tune hyperparameters = {α X , α G , max X , max G } using Algorithm 2 Fit g(y) = β 0 + β X • β G , Z , with elastic net penalties using the tuned Network features within X = {j :β X j = 0, j = 1, ..., q} Network features within G= {k :β G k = 0, k = 1, ..., p} Tuning on stability MOSCATO assumes fixed values for α X , α G , max X , and max G which will be tuned using an extension of the Stability Approach to Regularization Selection (StARS) method [22]. Tuning on accuracy, such as by cross validation or Bayesian information criterion, tends to result in overly dense solutions in high dimensional problems with results that are not reproducible [22]. The most extreme scenarios for stability are perfectly stable results from selecting no features (i.e., completely sparse) or selecting all features (i.e., no sparseness). Building on that logic, StARS initializes the parameters to the sparsest solution and gradually relaxes the sparsity until some instability threshold φ is met. Instability is estimated based on subsamples from the data by performing the feature selection under each subsample and summarizing the consistency in results across different subsamples. Although φ may initially be thought of as an arbitrary cutoff between 0 and 1, it may be easily interpreted as the amount of allowable instability. In essence, a smaller φ would imply a more sparse but stable result. The motivation behind allowing some instability as opposed to fixing φ to 0 is to allow some noise to be selected in order to ensure that no true signal is missed in the final feature selection. In statistical terms, this means that StARS prioritizes reducing type II errors.
Although the StARS method was developed for tuning a single sparsity parameter, the four hyperparameters α X , α G , max X , and max G will be tuned using similar logic. Focusing on tuning one dimension at a time, we initialize to a sparse solution with some small max X . Fixing max X , we estimate the instability for a range of α X values between 0 and 1. Select the α X value resulting in the lowest instability, and if that instability is less than φ, increase max X and repeat the process. This continues until the φ instability is hit to select max X and α X . Using the highest max X and corresponding optimal α X with instability less than φ, a similar process is then repeated for tuning max G and α G . In this case with two dimensions, one for X and another for G, we first tune α X and max X for some fixed α G and max G , and then usem ax X andα X when tuning α G and max G . The initial fixed α G and max G may be kept large, suppose α G = 0.5 and max G = floor(p/2) such that an overly sparse G does not impact the stability on X for the first dimension of tuning. This is summarized in Algorithm 2.

Algorithm 2 StARS Method for Tuning Tensor Hyperparameter
Require: G, X , y, grid α , φ, S, c, max G 0 , max X 0 Generate S random samples, each of size c Initialize max G to floor(0.5 * n) and α G = 0.5 Repeat the process to estimatem ax G andα G using optimalm ax X andα X

Simulation details Theoretical details of simulations
Splatter [37] is a popular technique to simulate singlecell RNA-seq (scRNA-seq) data, and it has been shown to mimic distributions from real scRNA-seq data. The general Splatter schematic initiates by simulating a gene mean and then adjusts the gene mean to account for variation in outliers, library size, and dispersion. It then simulates the "observed" scRNA-seq values through a Poisson distribution using the adjusted gene mean, and the values are then randomly truncated to zero to replicate dropouts. Although Splatter realistically portrays scRNA-seq distributions, a few extensions were required in order to simulate multimodal single-cell data with supervised gene networks.
To accomplish this, we leverage latent structures for multimodal supervised networks detailed in Zhang et al. [38]. Figure 7 demonstrates the expected causal relationships within the data containing a supervised multimodal network. In summary, we expect there to be a subset of features within G and X that relate to the outcome but not with each other; a subset of features within G and X that relate to each other but not with the outcome; a subset of features within G and X that are independent of each other and the outcome; and finally the target subgroups of the analysis, subset of network features within G that relate to the network features within X that ultimately relates to the outcome. These expected relationships among these latent components may be represented through a covariance matrix where the off diagonals will be non-zero where relationships exist and zero where independence is expected. More details are provided in the Additional file 1.
To extend Splatter for supervised multimodal networks, we will first simulate the latent values (S, H, X, G, G , X , and y) for each subject using a multivariate normal distribution with zero mean and the latent variables' covariance matrix. Since the latent values were simulated from a multivariate normal, they will each be normally distributed with mean 0. The Splatter simulation assumes the initial gene mean comes from a gamma distribution, so we take the square of the latent value divided by its standard deviation. By doing so, the transformed latent values then become gamma distributed with shape equal to 1/2 and scale equal to 2 times its variance. These transformed latent values will be used as the initial gene means for each subset of features, and the latent value for y will be used as the mean to simulate an observed outcome from a normal distribution. Initial gene means for the noise subset of features are randomly simulated independently. Additional theoretical details may be found in the Additional file 1.
Dispersion is adjusted on a subject level, library size is adjusted on a cellular level within a subject, outliers are adjusted on a feature level within a subject, and dropouts are accounted for on a cellular/feature level within a subject. The Splatter simulation is performed using the transformed gene means (combining all latent components Fig. 7 Latent structure for supervised multimodal networks. From [33] and adapted from [38]. Assume two data types, G and X . S describes the subset of features within X that relate to the disease outcome but not with any features within G; H describes the subset of features within G that relate to the disease outcome but not with any features within X ; G and X describe the network where the subset of features within G relate to the subset of features within X that ultimately relate to the outcome; G and X denote the subset of features that relate to each other but not with the outcome; and finally each data type will have independent noise not related to each other or the outcome within G and X ), and then the features are later separated by data type for each subject.

Simulation specifications
MOSCATO was applied to a series of simulations using the techniques described in the previous section. Simulations were performed under 9 different settings accounting for the average number of cells per subject (250, 500, or 1000) and amount of technical noise (low, moderate, or high). Simulations were replicated 50 times under each simulation setting and each simulation had 100 subjects. G contained 1440 total features where only 10 belonged to the network, and X contained 1555 total features where only 15 belonged to the network. Table 1 describes the total number of features contained within each of the latent variables described in Fig. 7.
For tuning the hyperparameters, we set grid α = {0.2, 0.5, 0.7}, φ = 0.02, R = 50, and c = 50. max G 0 and max X 0 were initially set to 5, although this number may be increased in order to reduce runtimes as long as the instability remains below φ for its initialization. Ideally, MOSCATO would tune max G to 10 and max X to 15 in order to select the proper network size according to Table 1, but this will be unlikely due to the mechanics behind the StARS tuning method described in "Tuning on stability" section which prioritizes reducing type II error over type I error.
In addition to applying MOSCATO, we also applied competing methods using the AUC. Seurat provides a popular single-cell sequencing workflow, and following similar methods used by the authors of Seurat [10], this AUC approach was done using the presto version 1.0.0 R package [19]. Selections using AUC were performed using two different criteria. One criterion was based on whether the Bonferroni adjusted p-value was less than the nominal significance level (set to 0.05) under the null hypothesis that the AUC equals 0.5. Additionally, selection criteria using cutoff values where features with an AUC either less than 0.3 or greater than 0.7 were selected. Since AUC requires categorical outcomes, we use the median of the outcome to binarize it (i.e., if the outcome is less than the median then recode the outcome as '0' , otherwise if the outcome is greater than the median then recode as '1').

Real data application details
Data was combined across multiple studies for healthy and leukemia subjects. The studies used to obtain the data are summarized in Table 2. Only non-perturbed, baseline cells were considered.
Seurat version 4.0.3 [10] was used to normalize the data, cluster cells, and integrate the cell types across subjects.
We applied MOSCATO to each of these cell clusters separately, each with grid α = {0.01, 0.05, 0.1}. Since Seurat clusters cells by maximizing correlation between features, the multicollinearity across features would consequently be high and require more weight on the L 2 -norm (i.e., lower α within the elastic net constraint).
Due to the modest sample size, we tuned the hyperparameters using a subsampling size of 20 (out of 21 total subjects) to estimate stability based on a "leave one out" scheme. Although StARS suggests setting the instability threshold φ to 0.05 for most applications [22], in this application with a small sample size (i.e., only 21 subsamples) and large number of RNA features (i.e., 17991 variables), the estimated instability under sparsest solutions will be much smaller than 0.05. For example, suppose all 21 subsamples select completely disjoint feature sets, but due to the high number of variables in consideration, many variables are consistently excluded from any selections across all of the subsampled results. Since StARS considers both consistency in selections and consistency in exclusions, the estimated instability will be quite small due to the consistency in exclusions despite if the small number of features selected may be completely disjoint across all subsamples. Therefore, φ was set to 0.001.
To compare selections with another method, the MOSCATO results were compared to selections based on the AUC. The AUC approach was done using the presto version 1.0.0 R package [19]. Similarly as was done for MOSCATO, the AUC feature selections were performed on each cell cluster separately. Feature selections were made under two different selection criteria for AUC: if the Bonferroni adjusted p-value was less than 0.05 under the null hypothesis that the AUC equals 0.5 or whether the AUC was less than 0.3 or greater than 0.7. Since the p-value would likely be small in situations with many cells (i.e., large sample sizes producing sensitive p-values for miniscule AUC deviations from 0.5), both a p-value approach and an approach based on the AUC values were considered. The real data application may be reproduced by following the steps provided at https://github.com/lorinmil/ MOSCATOLeukemiaExample.